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We introduce a class of dimension reduction estimators based on 
an ensemble of the minimum average variance estimates of functions 

l/~) ' that characterize the central subspace, such as the characteristic func- 

tions, the Box-Cox transformations and wavelet basis. The ensemble 
estimators exhaustively estimate the central subspace without im- 
posing restrictive conditions on the predictors, and have the same 
convergence rate as the minimum average variance estimates. They 
are flexible and easy to implement, and allow repeated use of the 

i-Q , available sample, which enhances accuracy. They are applicable to 

jrt ■ both univariate and multivariate responses in a unified form. We es- 

tablish the consistency and convergence rate of these estimators, and 
the consistency of a cross validation criterion for order determination. 
We compare the ensemble estimators with other estimators in a wide 
variety of models, and establish their competent performance. 
> 






1. Introduction. Sufficient dimension reduction [Li (1991, 1992), Cook 
and Weisberg (1991), Cook (1994, 1996)] is a methodology for reducing the 
dimension of predictors while preserving its regression relation with a re- 
sponse. The reduction is achieved by projecting the raw predictors on to 
^j ■ a lower-dimensional subspace. Let (X,Y) be a pair of random vectors of 

dimensions p and s. In this section we tentatively assume s = 1. Let S de- 
note a subspace of M p , and let P$ denote the orthogonal projection on to S. 
If Y and X are independent conditioning on PsX, then PsX can be used 
as the predictor without loss of regression information. Such subspaces S 
are called dimension reduction subspaces. The intersection of all such sub- 
spaces S, if itself satisfies the conditional independence, is called the central 
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subspace [Cook (1994)], and is denoted by Sy\x- Under mild conditions 
[Cook (1996), Yin, Li and Cook (2008)], the central subspace is well denned 
and is unique. 

A closely related concept is the notion of central mean subspace [Cook and 
Li (2002)], which is the intersection of all subspaces such that E(Y\X) = 
E(Y\PgX). This subspace is written as S^(y\x)- Evidently, if conditional 
distribution of Y given X depends on X only through E(Y\X), then Sy\x = 
Se(y\x)- However, if this conditional distribution also depends on other func- 
tions of X, such as var(Y|X), then S E {y\x) is a proper subspace of Sy\x- 
Cook and Li (2002) noted that several previously introduced dimension re- 
duction methods, such as the ordinary least squares [Li and Duan (1989), 
Duan and Li (1991)] and principal Hessian directions [Li (1992), Cook 
(1998)], actually estimates the central mean subspaces; whereas some other 
pre-existing estimates, such as the sliced inverse regression (SIR), the SIR- 
II [Li (1991)] and the sliced average variance estimator (SAVE) [Cook and 
Weisberg (1991)], can recover additional directions in the central subspace. 

Yin and Cook (2002) extended central mean subspace to central moment 
subspace, based on the relation E(Y \X) = E(Y \P$X), which is written 
as S E i Y k \x)- This provides us with a graduation between the central mean 
subspace and the central subspace. That is, for sufficiently large k, the 
subspace spanned by {S E ^ Y l \x)i i = l,---,k} approaches the central sub- 
space. Zhu and Zeng (2006) showed that the central mean subspaces for 
E(e \X),t e R, when put together, recovers the central subspace, and ex- 
ploited this relation to develop a Fourier transformation method to estimate 
the central subspace. Here and throughout, we use t to denote the imag- 
inary unit \J — 1. More recently, Zeng and Zhu (2010) developed a general 
integral transform method. Both papers hint at the following fact: if one 
can estimate the central mean subspace of £/[/(X)|y] for sufficiently many 
functions /, then one can recover the central subspace. 

In a seminal paper, Xia et al. (2002) introduced a dimension reduction 
method, called the minimum average variance estimator (MAVE), based 
on estimation of the gradient of the conditional expectation E(Y\X). This 
method has three main advantages: (1) it estimates the central mean sub- 
spaces exhaustively; (2) it does not impose strong assumptions on the distri- 
bution of X; (3) its computation can be broken down into iterations between 
two quadratic optimization steps, each of which having an explicit solution. 
However, a drawback of this method is that it cannot estimate directions 
outside the central mean subspace. For example, it cannot recover direc- 
tions in the conditional variance function var(Y|X). To remedy this defi- 
ciency, Xia (2007) and Wang and Xia (2008), respectively, proposed density 
MAVE (DMAVE) and sliced regression (SR) that can exhaustively estimate 
the central subspace. The former is based the estimation of the gradients of 
the functions E{H[(Y — a)/b]\X}, where H is a known probability density 
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function, a G (—00,00) and b £ (0, 00). The latter is based on the estimation 
of the gradients of the functions E[I(Y < c)\X], where c is an arbitrary con- 
stant. Here, again, we see the echo of the same basic fact that estimating the 
central mean subspaces for a rich enough family of functions is equivalent 
to estimating the central subspace itself. 

The ensemble approach introduced in this paper is based on the same 
fact, but it is more general, more flexible and, in the numerical examples we 
considered, more efficient. In broad outlines the procedure can be described 
as follows. Consider a general family g of functions of Y. For each / G $, 
let SE[f(Y)\x] denote the central mean subspace for the conditional mean 
E[f(Y)\X]. We say that g" characterizes the central subspace if the subspace 
spanned by the collection of subspaces {<Sm/(y)|x] : / £ -$} 1S equal to the 
central subspace. We introduce a probability measure on g, and randomly 
sample functions f\ , . . . , f m from g according to this probability. We then 
assemble the central mean subspaces SE[f e (Y)\x]i £= I,- ■ ■ ,m, together to 
recover the central subspace. 

In principle, the ensemble approach can be used in conjunction with any 
estimators of the central mean subspace to recover the central subspace, 
such as the ordinary least squares, the principal Hessian directions, MAVE 
and its two variants: the outer product of gradients (OPG) and the refined 
MAVE (RMAVE). In this paper we focus on its combination with MAVE 
and its variants, and refer to this combination the MAVE (OPG or RMAVE) 
ensemble. We show that these ensemble estimators exhaustively estimate the 
central subspace and that the RMAVE ensemble has the same convergence 
rate as RMAVE itself. We also introduce a cross validation criterion to de- 
termine the dimension of the central subspace, and establish its consistency. 
Through a number of simulation experiments, most of which are based on 
published models, we demonstrate the superb performance of the RMAVE 
ensemble based on the family g = \e itY : t € R}. We also explore other char- 
acterizing families, such as the Box-Cox transformations and wavelet basis. 

The rest of the paper is organized as follows. In Section 2 we investigate 
what types of family g can characterize the central subspace. We introduce 
the MAVE ensemble in Section 3 and outline the parallel developments 
for OPG ensemble and the RMAVE ensemble in Section 4. In Section 5 we 
introduce a cross validation criterion for order determination and discuss the 
choices of the characterizing family J, with emphasis on the characteristic 
function and the Box-Cox transformations. In Section 6 we establish the 
consistency and derive the convergence rate of the RMAVE ensemble, and 
establish the consistency of the cross validation estimator. In Section 7 we 
conduct simulation comparisons between the RMAVE ensemble and other 
estimators in a large variety of models. Some concluding remarks are made 
in Section 8. 
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2. Characterizing the central subspace. The basic fact that underlies 
our approach is that the dimension reduction subspaces for the conditional 
means £ , [/(K)|X], when combined in unison, can recover the dimension 
reduction subspace for Y versus X. For this idea to work, the family of / 
needs to be sufficiently rich, and in this section we rigorously pose and study 
this characterization problem. 

Let X be a p-dimensional random vector defined on £lx and Y be an 
s-dimensional random vector defined on Qy Let 5 be a family of functions 
/ : f2y —> F, where F can be the set of real numbers M. or complex numbers C. 
Let SE\f(Y)\x] denote the central mean subspace for the conditional mean 
E[f(Y)\X], as defined in Cook and Li (2002) and Yin and Cook (2002). 
That is, SE[f(Y)\X] is the intersection of all subspaces of W such that 

(1) E[f(Y)\X]=E[f(Y)\P s X]. 

Let Sy\x denote the central subspace of Y versus X as defined in Cook 
(1994). That is, Sy\x is the intersection of all linear subspaces of W such 
that 

(2) YMX\P S X. 

Note that here we do allow Y to be a random vector; whereas the mentioned 
previous works assume Y to be a scalar. This relaxation is made possible by 
the transformation /, which takes value in the scalar field F. 

Definition 2.1. Let 5 be a family of measurable F-valued functions 
defined on Qy If 

(3) span{S E[f ^ Y )\x]-f^d} = S Y \x, 

then we say the family $ characterizes the central subspace. 

Let Fy denote the distribution of Y, and let Z^-Fy) be the class of func- 
tions f(Y) with finite variances, together with the inner product (/i,/2) = 
E[f 1 (Y)f 2 (Y)]. Let Li(Fy) be the class of functions f(Y) such that 
E\f(Y)\ < oo, together with the norm E\f(Y)\. We denote the subspace on 
the left-hand side of (3) by S($). Note that E[f(Y)\X] is finite if / € Lx{F Y ). 

Lemma 2.1. Suppose that $QLi(Fy)- Then the following assertions 
hold: 

(1) S($)QSy\ X . 

(2) If (1) being satisfied for all f €# implies (2), then Sy\x f= <->($)• 

Before proving this lemma we first note the following fact. If S, S\ and S2 
are linear subspaces of M p , then 

(4) S\ C £2 if and only if {5:5 contains 52} C {5 : 5 contains 5i}. 
This can be easily seen by taking intersection on both sides of the equality. 
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Proof of Lemma 2.1. (1) Let 5 be a subspace of W that contains S Y \x- 
Then (2) holds, and consequently (1) holds for all f £$. This implies that 5 
contains S B [f(y)\x] f° r an f £$■ Since 5 is a linear subspace, it must con- 
tain 5(5 r ). Hence 

{5 : 5 contains Sy\x} ^{5:5 contains 5(5)}, 

which, by (4), proves part 1. 

(2) Let 5 be a subspace of MP that contains S($). Then (1) holds for all 
/£j. By assumption this implies (2), and consequently 5 contains Sy\x- 
Hence 

{5:5 contains 5 (#)} C {5:5 contains Sy\x], 

which, by (4), implies S Y \x Q 5(5)- d 

Let 53 be the family of measurable indicator functions of Y . That is, 
<B = {I B : B is a Borel set in Oy}. Note that 23 C L 2 (F Y ). 

Theorem 2.1. If $ is a subset of L 2 {Fy) that is dense in 53, then 5 
characterizes the central subspace. 

Proof. Because 5 is a subset of L 2 (Fy), it is also a subset of L\{Fy). 
Hence, by Lemma 2.1, it suffices to show that (1) being satisfied for all / £ J 
implies (2). 

Let 5 be a subspace such that (1) holds for all / G $, and let B be a Borel 
set in Uy ■ Because $ is dense in !B there is a sequence {fk} CJ such that 
lim^^ E[I B (Y) - fk(Y)] 2 = 0. For any g G L 2 {F X ) we have 

£{<7pO[/ B (Y) - E{I B {Y)\P S X)]} 

(5) = £%(X)[I B (r) - E(f k (Y)\P s X)}} 

+ E{g(X)E[f k (Y)-I B (Y)\P s X}}. 
The square of the second term on the right is no more than 

E[g 2 (X)]E{E 2 [f k (Y)-I B (Y)\P s X]}<E[g 2 (X)}E{E[f k (Y)-I B (Y)] 2 }^0. 

Since /^Jwe have E[f k (Y)\P s X] = E[f k {Y)\X). Hence the first term on 
the right-hand side of (5) can be rewritten as 

(6) E{g{X)[I B {Y)-E(I B {Y)\X)]}+E{g{X)E[I B (Y)-f k {Y)\X}}. 

The first term is by the definition of conditional expectation. The square 
of the second term in (6) is no more than 

E[g 2 (X)]E{E 2 [I B (Y) - f k (Y)\X]} < E[g 2 (X)]E{[I B (Y) - f k (Y)] 2 } -+ 0. 

Since the left-hand side of (5) does not depend on k, and the right-hand side 
converges toOas^oo, we have E{g(X)[I B (Y) - E(I B (Y)\P S X)}} = 0. 
By the definition of conditional expectation the above being true for all 
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9 € L 2 (F X ) implies E[I B {Y)\P S X)} = E[I B (Y)\X] almost surely. Since B is 
an arbitrary Borel set in Qx, this implies Y _U_ X\P$X . □ 

This theorem synthesizes several recently developed methods in the lit- 
erature, and also anticipates useful new ways to combine central mean sub- 
spaces into the central subspace. The following examples demonstrate its 
potential. 

Example 2.1 (Polynomials). Let$={Y t :t = l,2, ...}. Then S E if(Y)\x] = 
Se(y ± \x)- This is the type of dimension reduction subspaces studied by Cook 
and Li (2002), Yin and Cook (2002) and more recently Zhu and Zhu (2009). 
If the conditional moment generating function E(e \X) is finite in an open 
interval that contains 0, then 5 is dense in L2(Fy), an d hence character- 
izes Sy\X- 

Example 2.2 (Kernel density). Let b > and H be a symmetric proba- 
bility density function defined on R. Let 5 = span{6~ 1 ff[(y — t)/b] : t G R, b G 
R + }. Xia (2007) proposed a DMAVE method that, in effect, recovers Sy\x 
by estimating Se[/(y)\x] f° r / £ 5- This family is dense in L2{Fy) when H 
is the normal density. See, for example, Fukumizu, Bach and Jordan (2009). 

Example 2.3 (Slices). Let £ = {/(-^(y) :t G R}. Then 5 is clearly 
dense in *B. The method proposed by Wang and Xia (2008) is based on the 
estimation of SE[f(Y)\x] f° r / i n this family. 

Example 2.4 (Box-Cox transformations). Let Y be a nonnegative ran- 
dom variable, and consider the family of transformations 

(V-i 

(7) f t (y) = )—T> ^°' 

{\og{y), t = 0. 

This is the Box-Cox transformation [Box and Cox (1964)]. This family char- 
acterizes the central subspace because it contains the family in Example 2.1. 

Example 2.5 (Characteristic function). Let £ = W ty : t G R}, where t = 
\/— 1. Note that £(e t<y |X) is simply the conditional characteristic function 
of y | X. It is well known that this family is dense in L2{Fy)- It is used by Zhu 
and Zeng (2006) to recover the central mean subspace and central subspace, 
respectively, based on the assumption that X is multivariate normal. This 
family is also our focus when we implement the ensemble estimators. 

Example 2.6 (Haar wavelets). Let 

M, t€ [0,1/2), 

m=\-h t£ [1/2,1), 

1 0, otherwise. 
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Consider the family 5 = {1} U {ip{2 n y - k) : k = 0, . . . , 2 n - 1; n = 1, 2, . . .}, 
where the 1 in {1} represents the function of y that always takes the value 1. 
This is the famous Haar basis often used in wavelet estimators. See, for 
example, Donoho and Johnstone (1994) and Antoniadis and Fan (2001). 
The Haar basis is obviously dense in OS and hence characterizes the central 
subspace. 

In this paper we only consider parametric characterizing families 5- That 
is, 5" is of the form {ft : t £ ^t} where Qt is a subset of a Euclidean space M. q . 
All the characterizing families in the above examples are parametric. In 
the following, for a sequence of subspaces {Sk} and a subspace 5, we say 
linifc^oo 5fc = S if lim^ooH-Ps,. — Ps\\ = 0, where || • || is a matrix norm, such 
as the operator norm or the Frobenius matrix norm. The two norms are 
topologically equivalent, and makes no difference in asymptotic analysis. 
See, for example, Li, Zha and Chiaromonte (2005). Note that we are inter- 
ested in Sy\x y i a s P an {SE[f t (Y)\x] '■ t £ ^t}, then a question arises is whether 
we can recover Sy\x from a finite t € £It- Indeed, we can. Theorem 2.2 below 
demonstrates that, with probability 1, the central subspace can be charac- 
terized by a finite number of functions in a characterizing family. In essence, 
it relies on the following fact: if a sequence of subspaces S m converges to 
another subspace S from within, then the norm ||5 m — 5|| is discrete in 
nature; that is, if this norm converges to then it must be identically 
for large m. This phenomenon is also noticed in Yin, Li and Cook (2008). 
The next lemma, albeit simple, reveals this discrete nature of dimension 
reduction. 

Lemma 2.2. Let S± C 5 2 be two subspaces of MP. Then \\Ps 2 ~ Psi\\ is 
either or no less than 1. 

Proof. If S\ = <S 2 , then ||Ps 2 - P Sl \\ = 0. If S\ / <S 2 , then direct differ- 
ence 1S2 © S\ is nonempty. We know that in this case Ps 2 — Ps x = -Ps 2 eSi- 
Let v be a unit vector in 5 2 S\ . Then 

ii^ 2 e 5l ii 2 >n- T ii 2 =EE(^) 2 =i:^E^ = i. 

i=l j=l i=l j=l 

This completes the proof. □ 

Let Bq = (/3i, . . . , j3d ) be an orthogonal basis for the central subspace, Sy\x, 
whose dimension is oIq . In the following, we will randomly sample T\ , . . . , T m 
from £It- In this setting, we assume that these random elements are defined 
on a measurable space (fl,A). Then £It is interpreted as the range of the 
mapping Tj : Q — > f^. We denote a generic member of f2 by uj. 
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Theorem 2.2. Suppose that $ characterizes the central sub space, T\, 
Tii ■ ■ ■ is an i.i.d. sequence of random variables supported on Q,<r and, for each 
integer m, B{T\, . . . ,T m ) is an orthogonal basis matrix of sp&n{Smf T ,(Y)\X] '■ 
i = l,... ,m}. Then the following event has probability 1: 

{id G Q: there is an integer mo(w) such that, 

for all m > mo(cj),sp&n(B(Ti(cj), . . . ,T m (uj)) = span(i?o)}- 

Proof. For u = I,..., do, let A u be a subset of {t G Qt} such that 
j3 u $ s P &n {^E[ft(Y)\X]i't' £ Au}- If P(T G A u ) = 1 for some u, then ^ does 
not characterize Sy\Xi which is a contradiction. Hence P(T e A u ) < 1 for 
u = 1, . . . ,do. Let 

(8) 5(T U ...,T m ) = \\B(T U . . .,T m )B T (T 1 , . . . ,T m ) - B Bj\\. 

Note that S(T±, . . . ,T m ) ^ if and only if, for some u G {1, . . . ,do}, T\, . . . ,T m 
all belongs to A u . This is the event {J^ =1 PlfcLii^fc G A. u }, and has probability 

(do m \ do 

{JC]{T k GA u }\<Y^[P(T€A u )] m . 
u=lk=l / u=l 

Since 

oo do do p(rr A n 

m=lu=l u=l 

we have, by the first Borel-Cantelli lemma, with probability 1, 

(9) lim 5(T l ,...,T m ) = 0. 

Since span[-B(Ti, . . . ,T m )] C span(i?o)) by Lemma 2.2, event (9) occurs if and 
only if S(T\, . . . ,T m ) becomes for sufficiently large m. Thus, with proba- 
bility 1, there exists an mo(w) such that for m > mo(w), sp&n[B(Ti(uj), . . . , 
T m (tjj))] =span(_B )- □ 

3. MAVE ensemble. We first describe our method at the population 
level, and then develop the estimation procedure at the sample level. The 
idea underlying MAVE can be outlined as follows. Assume that the central 
mean subspace Se(y\x) has dimension d$ < p. Let f3 be a p x do matrix such 
that span(/3) =S E (y\x)- Then 

0E(Y\X = x)/dx = (3[dE{Y\/3 T X = u)/du]. 

Since the vector on the right always belongs to span(/3), we can recover 
span(/3) by estimating the gradient of E(Y\X = x). This is achieved by 
local linear regression. Let K^ be a probability density function defined 
on W where h is proportional to the square root of the largest eigenvalue 
of the variance matrix under K^. Let fx be the density of X. Consider the 
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objective function 

i E{[Y- a(x) - b T (x)B T (X - x)] 2 K h {X - x)}f x (x) dx, 
Jn x 

where oiflx^K.iiflx^R^^e W xd °. Let (a* h (-),b* h (-),B* h ) be the min- 
imizer of the above function over all possible functions a, b and constant 
matrices B, then it can be shown that lim/ l _>o s P an (-^/i) = Se(y\x)- See Xia 
et al. (2002). 

We now describe at the population level how to assemble a collection of 
MAVEs to recover the central subspace. Let 5 = {ft '■ t G £It} be a parametric 
characterizing family. Throughout we assume F = C, though the subsequent 
statements are true also for F = R by simply discarding the imaginary part. 
Let ft{y, 1) and ft(y,2) denote the real and imaginary parts of ft(y)- That 
is, ft(y) = ft(y, 1) + t/t(y, 2). Let T be a random vector defined on J7t, 
with distribution Ft- Applying the MAVE procedure to the transformed 
response ft(Y) and integrating with respect to the distribution Ft leads to 
the following population-level objective function: 

2 r 

J2 E{[f t (Y,£)-at(x) 

£ =1 Jn T xtt x 
(10) 

- bj(x)B T (X - x)] 2 K h (X - x)} dF x {x) dF T (t). 

We minimize this function over all R- valued functions a^(-), £ = 1,2, all 
R° -valued functions be(-), i = 1,2 and all p x do constant matrices B. 

At the sample level, suppose that (X\, Y"i), . . . , (X n , Y n ) are independent 
copies of (X, Y). Let Kq(-) be a symmetric probability density function 
define on R. For any v G W and h G R + , let K h (v) = h~ p K (\\v\\/h). Let 

n 

Wij {h) = K h (Xi -Xj)/J2 K h (X u -Xj). 

u=l 

Let Ti, . . . , T m be an independent sample from Ft- Mimicking (10) we min- 
imize the sample-level objective function 

2 m n n 

£=i /c=i j=i i=i 

over scalars {ajfc(£) : j = 1, ... ,n,k = 1, ... ,m,£ = 1,2}, do-dimensional vec- 
tors {bjk{t} : j = 1, ... ,n;k = I, ... ,m,£ = 1,2} and p x do matrices B. The 
coefficients {pj'-j = l,...,n} are trimming constants. Their purpose is to 
exclude those X's with too few observations around, which are unreliable. 
Let p : R — >• R be a function with a bounded second derivative such that 
p(y) > if v > v o and p(f) = if v < t>o, for some small Do > 0. We take 
/a,- = jo(n~ z27=i-Kh(Xi — Xj)). The bandwidth h is taken to be propor- 
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tional to n _1 '' p+4 ', which is the optimal bandwidth in the sense of mean 
integrated squared errors. For more details about the trimming constants 
and the bandwidth, see Xia et al. (2002), Fan, Yao and Cai (2003), Wang 
and Xia (2008). 

A rather appealing aspect of this procedure is that the minimization of the 
objective function (11) can be broken down into iterations between two steps, 
each of which is a quadratic optimization problem having an explicit solu- 
tion. More specifically, for a fixed B £ M pxrf °, minimize (11) over ajk(£),bjk(£) 
for j = 1, . . . ,n, k = 1, . . . ,m,£ = 1,2. Note that, for each triplet (j,k,£), the 
summand of (11), 
n 

(12) J2p j w ij (h)[f n (Yi,e) - a jk (£) - b] k (£)B T (X t - A,)] 2 

i=l 

depends on and only on ajk(£),bjk(£)- As a result, minimizing (11) jointly is 
equivalent to minimizing (12) individually. This is a least-squares problem 
whose solution is 

( p\ \ r n 

- X>i(%;Ay(i?)A;.(i?) 



bjk{f) 



^2w^p,A t ,(B)f Tk (Y t ) 



where A i:j (B) = (1, (X t - Aj) T £) T . 

For fixed a j /-(£), bj k (£), j = 1, . . . , n, k = 1, . . . , m, £ = 1, 2, the minimiza- 
tion of (11) is again a least-squares problem. The solution is 



T 



-1 



vec(B) = [J2pj"ij(h)(bjk(l) ® (X { - Xj))^^) <g> (X { - X 3 )) 

x [X>*MM') ® (*i ~ x 3 ))Un{Y u £) - a jk (£))] , 

where the summation is over 

(13) (i, j,k,£) G {1, ... ,n} x {1, ... ,n} x {1, ... ,m} x {1,2}. 

Thus, starting with an initial estimate of Bq of Sy\xi which, for example, 
can be the OPG ensemble described in the next section, we iterate between 
the above two steps until convergence. More specifically, let B^ r ' be the 
estimate at the rth iteration. We stop when ||-Pg( r ) — Pg( r+ i)|| is smaller 
than some preassigned constant, such as 10~ 6 . The subspace span(B^ r+l ') 
is the estimate of Sy\x- We call this procedure the MAVE ensemble and the 
integer m the ensemble size. 

4. Variations of MAVE ensemble. Besides MAVE, Xia et al. (2002) also 
introduced two companion estimators: the outer product of gradients (OPG) 
and a refinement of MAVE (RMAVE). The former only involves eigen de- 
compositions and is very easy to compute. It is in general less accurate than 
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MAVE, but can be used as an initial estimate for MAVE. The latter involves 
iterations of steps, each similar to MAVE. It is more accurate than MAVE, 
and can take MAVE as its initial estimate. In this section we develop parallel 
generalizations of these methods, which we call the OPG ensemble and the 
RMAVE ensemble. 

4.1. OPG ensemble. Let 5 r , Ft, Ti,...,T m and Wij{h) be as defined in 
previous sections. For each j,k,£, we minimize the objective function 

n 

J2v>ij{h)\fT h (Yi,£) - a - b T (X l -X J )} 2 
i=i 

over (a, b) E M. x MP for each j = 1, . . . , n, k = 1, . . . , m and £ = 1,2. This is 
a least-squares problem and its solution can be written down explicitly, as 

a jk (£)' 



^ Wij Ay(/ p )A5(/ p ) 



J^WijfniY^Aijilp 



b jk (£) 
We then construct the following OPG matrix [Xia et al. (2002)]: 



2 



i=\ k=ij=i 



We use the do eigenvectors of this matrix corresponding to its largest eigen- 
values as an estimate of Sy\x- This estimate shares the desirable property 
of OPG. Numerically, all it needs is the calculation of least squares estimate 
and principal components, none of which involves numerical optimization. 
As such it is very easy to compute and does not run into local minimum 
problem, making it an ideal initial estimate for MAVE ensemble. 

4.2. RMAVE ensemble. The idea of RMAVE is to use an existing con- 
sistent estimate of ft to reduce the dimension of the kernel function, so that 
smoothing is carried out over a do-dimensional, rather than a p-dimensional 
subspace. When do is small, this can mitigate the effect of the curse of di- 
mensionality. In particular, when do < 3, it achieves the -y/n-convergence 
rate. 

Let H h :R d ° — > R+ be the d -dimensional kernel function h~ do K (\\v\\ / h) , 
where v is a do-dimensional vector. We minimize the objective function 

(14) X>,[/T fe (F^) - a jk (£) - b jk (£) T B T (X l - X^fH^B 1 \X t - X,-)], 

where the summation is over the indices in (13). Notice that, if we fix the B 
in the kernel H^, then the objective function is similar to MAVE, and can 
be computed by iterations between two least squares problems, as described 
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in Section 3. We can then substitute the updated B and repeat the process 
until convergence. The algorithm for RMAVE ensemble is summarized as 
follows. 

Let Bq be an initial estimate of Sy\x- For example, we can use the MAVE 
ensemble to calculate the initial estimate. Set ho oc n~ 1 ' (p+ 4 ) and r = 1. 

(1) At step r, let h r = max{?/i r _i,/i}, where <^ G (1/2,1) and h = hoX 
n -i/(do+4)_ N t e that h r is a decreasing sequence that converges to H. So h 
is the final bandwidth. The purpose of starting with a wider bandwidth 
and narrowing it gradually is to avoid being trapped in a local minimum 
at an early stage, as well as to achieve a faster rate of consistency. The 
proportionality constant ho can be selected by the rules as suggested by 
Scott (1992). Let 

n 

Vij {h r ) = H hr [B^(X t - X 3 )]/Y,H hr [B^(X t - X 3 )}. 



i=l 



Let 



p jr = p(n- 1 J2Hh r [(BV) T (X i -X j )} 



\ i=l / 

where p : M. — > M. is as defined in Section 3. 

(2) Use the two-stage iteration procedure described in Section 3, 
with Wij(h) and pj therein replaced by Vij{h r ) and pj r , respectively, to com- 
pute B^ r >. Note that Wij in Section 3 is computed from a p-dimensional 
kernel, whereas Wij(h r ) here is computed from a <io-dimensional kernel. 

(3) Standardize B^ r > so that it is a semiorthogonal matrix. That is, let 

fiWf- B^[B^ r) (B^) T r 1/2 - 

(4) If \\B( r \B^) T - I?('- 1 )(i3( , - 1 >) T || is less than a preassigned small 

number, say 10 -6 , then stop and set B = B r . Otherwise set r ^— r + 1 and 
return to 1. 

5. Order determination and choices of 5 r . In describing the foregoing 
algorithms we have assumed do, the dimension of the central subspace, to be 
known. In practice this dimension must also be estimated. We now propose 
a cross validation method to estimate do- Let B be the estimate of Sy\x f° r 
a fixed working dimension d. Then the leave-one-out fitted value of fx k (Yj , () , 
for j = 1, . . . ,n, k = 1, . . . ,m and £ = 1,2, is 

p k3 id,£)=J2^h[B T (X i -X j )]f Tk (Y i ,£)/J2 K h[B T (X i -X j )]. 
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The corresponding cross validation value is 

£=l k=lj=l 

To include the trivial case of d = 0, we define /*;y (0, £) to be 

so that CV(d) is defined for all d = 0,...,p. The structural dimension do is 
estimated by 

do = argmin{CV(d) :d = 0, . . . ,p}. 

As we have mentioned in Section 2, there are many possible choices for 5- 
In this paper we pay special attention to two families: the family deter- 
mined by the characteristic function, as discussed in Example 2.5, and the 
family that corresponds to the Box-Cox transformations, as discussed in 
Example 2.4. That is, 

$C = {e LtTy :t£R}, 3 B = {ff.t£R}, 

where ft is as defined in (7). An advantage of the family 5c is that its 
members are bounded functions, and as such are relatively robust against 
the outliers in Y. Moreover, it requires virtually no condition on the distri- 
bution of Y. Also note that when t ranges over M s , the function e bt y fully 
recovers the joint information of the random vector Y . In this respect the 
ensemble estimators are akin to Projective Resampling [Li, Wen and Zhu 
(2008)]. However, here the univariate and multivariate responses are treated 
in a unified manner: we simply replace e Lty by e Lt v , whereas in projective re- 
sampling the multivariate response is treated differently from the univariate 
response. 

The family 5b requires Y to be nonnegative. When Y is not nonnegative, 
we make the transformation Yj — min{Y"i, . . . , Y n } + 0.5 before applying the 
Box-Cox transformation. An advantage of this family is that often a few 
fixed functions in 5b would do a reasonably good job. In our simulation 
studies we have used t £ {—2, —1.5, — 1, —0.5,0,0.5, 1, 1.5,2}, as one typically 
uses for Box-Cox transformation. Note, however, if one uses such a finite, 
fixed set, then the corresponding 5 b is not guaranteed to characterize the 
central subspace, unless the distribution of Y satisfies some special condi- 
tions. Alternative transformations such as those proposed by Manly (1976), 
John and Draper (1980) and Bickel and Doksum (1981), for instance, that 
do not require Y to be positive may be used to form different family 5- 

Henceforth we indicate an ensemble estimator based on a family 5 by 
attaching-5 to the name of the original estimator, such as MAVE-5c or 
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RMAVE-Jg. To implement RMAVE-^c, we modify the code for the sliced 
regression in Wang and Xia (2008) based on a gradual descending algorithm; 
the random vectors T\, . . . , T m are an independent sample from iV(0, I s ). To 
implement RMAVE-^b, we adopt the code for RMAVE by Xia et al. (2002) 
and use the fixed set of t mentioned earlier. 

6. Consistency and convergence rate. In this section we investigate the 
asymptotic behavior of RMAVE ensemble based on 5c . We will study the 
convergence rate, assuming the structural dimension do is known, and then 
the consistency of the estimator of efo • The asymptotic analysis proceeds in 
two steps. In the first step (Theorem 6.1), we establish the convergence rate 
for a fixed set of functions {ft x , ■ ■ ■ ,ft m } i n #■ I n the second step (Theo- 
rem 6.2), we investigate the asymptotic behavior when m — > oo. The first 
step is not fundamentally different from the asymptotic results for DMAVE 
and SR as developed in Xia (2007) and Wang and Xia (2008). We have 
therefore relegated the proof to an external Appendix. The second step 
is a novel development and is presented in detail. Although here we only 
consider RMAVE-Jc we have no doubt that the development can be ex- 
tended to other characterizing families. For any finite set {ti, . ■ ■ , t m } C fir, 
let B(t±, . . . , t m ) denote the RMAVE-^c estimator described in Section 4.2, 
and let B(t\, . . . ,t m ) be a basis matrix of spa,n{Smf t /Y)\x] 'i = l,---,m} 
and B be a generic matrix with p rows. Without loss of generality, assume 
these matrices to be semiorthogonal. 

We need to make the following regularity assumptions, which are similar 
to those made in Xia (2007) and Wang and Xia (2008). 

(CI) Marginal distribution of X: The random vector X has a bounded 
support; its density function g(x) has a bounded second derivative; the func- 
tions 

(it, B) ^ E(X\B T X = u), (u,B) ^ E(XX T \B T X = u) 

have bounded derivatives for u G IR'' and B G {B : \\BB T — BqBq\\ < c}, 
where c > 0. 

(C2) Conditional distribution function ofY given B T X: The conditional 
density function g(y\u) of Y given B X has a bounded fourth-order deriva- 
tive with respect to x and u as B is in a small neighbor of Bq. 

(C3) Identifiability of minimum: For any semiorthogonal matrix B £ 
R pxrf , any constant c > and a set {t\, . . . ,t rn } C f^, 

2 m 

(C4) Kernel function: The function Kq is a symmetric univariate density 
with bounded second derivative and a compact support. 
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(C5) Bandwidth: For a working dimension d, the bandwidths {h r :r = 
0, 1, . . .} satisfy ho oc n -1 ' ^ >+ ', /i r = max{<j/i r _i, /i} with 1/2 < ^ < 1 and 

^ ocn -l/(rf+4)_ 

The following theorem gives the convergence rate of RMAVE-Jc for 
a fixed set of functions in $ and a fixed do- Let d(t\, . . . ,t m ) be the di- 
mension of the space spanned by {<->E[/ t .(y)|x] :* = 1, - - - , w}. 

Theorem 6.1. Suppose conditions (CI), (C2), (C4) and (C5) are safe- 
/ied, (C3) holds for {ti, ... ,t m } (^ Qt and set d = d(ti,...,t m ). Then, as 

n — >• oo, 

||B(ti,. ..,t m )B (ti,. . . ,t m ) — B(ti, . . . ,t m )B (h, . . . ,t m )\\ 

(15) 

= P [h A + log n/(nh do ) + n~ 1/2 }. 

Proof. Let 

(16) h(t 1 ,...,t m )cxn- l /W tl '~' t ^ +4 l 

Then, by arguments similar to those used in Xia et al. (2002), Xia (2007) 
and Wang and Xia (2008), it can be shown that 

\\B(ti,. . . ,t m )B (£i,. . . ,t m ) - B(ti, . . . ,t m )B (£i,. . . ,t m )|| 

= P (n 4 (t!, ...,t m ) + logn/[nh d ^-^ (h,..., t m )} + n- 1 ' 2 ). 

See the external Appendix. By (16), the right-hand side is of the order 
O p ( n -4/[d(t 1) ...,t m )+4] + (logn)n -4/[d(t 1 ,...,t m )+4] +rr l/2). 

Since the function in Op(-) is increasing in d(t\, . . . ,d m ), which is no more 
than do, relation (15) holds. □ 

Note that we are interested in S Y \x instead of spanLS E rj ( (y)|xi : * = 
1, . . . ,m}. The next theorem shows that, under the conditions no stronger 
than Theorem 6.1, RMAVE-Jc recovers the central subspace at the same 
rate as does RMAVE itself. 

Theorem 6.2. Suppose that conditions (C1)-(C5) hold, that T\, . . . ,T m 
are an independent sample from 0,j- and that they are independent of (X\ , Y"i), 
. . . , (X n ,Y n ). Let B{T U ...,T m ) be the RMAVE-$ C estimator of B(T U ..., 
T m ). Then, for any e > 0, 

n7 , r ,. n f \\B(Ti, ■ ■ .,T m )B T (T 1: . . . ,T m ) - B Bj\\ ^ \ 

(17) hm hm PI —— .. . . , ■ _ 1/9 ^>e = 0. 



■m— >oo n— >oo 



ft 4 + logn/(nh d °) + n -l/2 
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Proof. By Theorem 2.2, we have that 5(T\, . . . ,T m ) becomes for suf- 
ficiently large to. Consequently, P(liminf rn _ i>00 {<5(Ti, . . . ,T m ) = 0}) = 1. By 
Fatou's lemma, 

(18) liminf P(6(T 1} . . . ,T m ) = 0) > P ( lim inf {5{T X , . ..,T m ) = 0}) = 1. 

Thus we see that the bias term converges to infinitely fast. 
Next, let a n = h i + logn/(nh do ) + n -1 / 2 . We have 

a- 1 \\(B(T 1 ,...,T m )B T (T l ,...,T m )-B Bj)\\ 
< a^ 1 5 n (Ti, . . . , T m ) + a^ 1 5(T i , . . . , T m ), 
where 5(Ti, . . . ,T m ) is as defined before and 
8n(Ti, ■ ■ ■ ,T m ) 

= \\B(T lt . . . ,T m )5 T (T!, . .. ,T m ) - B(T U . . .,T m )B T (T 1 , ...,T m )\\. 
Since a n ^ 0, we have, by (18), 

PK^CZi, . . . , T m ) / 0) = P(S(T U . . . ,T m ) jt 0) -> as m -> oo. 

Since, despite its appearance, the term on the left does not depend on n, 
the above limit can be rewritten as 

(19) lim lim P(a- 1 <5(T 1 , . . . ,T m ) jt 0) = 0. 



in— >oo n— >co 



By Theorem 6.1, for a fixed set ti,...,t m , lim re _ K>0 P(a~ # n (ti, . . . ,t m ) > 
e) = 0. But because T±, . . . , T m and (Xi, Yi), . . . , (X n , Y n ) are independent, 
this implies 

lim P(a~ 1 6 n (T 1 ,...,T m ) > e\T\ = t 1 ,...,T m = t m ) =0. 

n— >oo 

By the dominated convergence theorem, linin^oo P(a~ 1 5 n (Ti, . . . , T rn ) > e) = 
0, and hence 

(20) lim lim Pia^SniTx, . . . ,T m ) > e) = 0. 

m->oo n->oo 

Now combine (19) and (20) to prove (17). □ 

Theorem 6.2 implies that if do < 3, then -^/n-consistency can be achieved 
by taking hex. n -1 /( fi fo+ 4 ). 

Next, we establish the consistency of the estimator of do described in 
Section 5. Let d(t±, . . . , t m ) be the cross validation estimator of d(t\, . . . , t m ). 
The proof of the following lemma can be found in the external Appendix. 

Lemma 6.1. Suppose that conditions (CI), (C2), (C4) and (C5) hold, 
(C3) is satisfied for {t\, . . . ,t m } C Ot and the bandwidth h~d used for different 
dimension d satisfies Kd oc n~ 1 ' ( rf + 4 ). Then we have 

lim P(d(ti,...,t m ) =d(ti,...,t m )) = 1. 
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We now consider the convergence to the structural dimension do as m — > 
oo. Let d(T\, . . . ,T m ) be the cross validation estimator of d(7\, . . . ,T m ), 
which is the dimension for B(Ti, . . . , T m ). 

Theorem 6.3. Under the assumptions in Theorem 6.2 we have 
lim lim P{d(T 1 , . . . ,T m ) = d ) = 1. 

m— >oo n— >oo 

Proof. Following the same argument that leads to (18) in the proof of 
Theorem 6.2, we can show that 

(21) lim P(d(T 1} . ..,T m ) = d ) = lim lim P(d(T l: . . . ,T m ) = d ) = 1. 

m— >oo m-»oon->oo 

As in the proof of Theorem 2.2, since span[i?(Ti, . . . ,T m )] C span(-Bo), by 
Lemma 2.2, event (9) occurs if and only if 5{Ti, . . . ,T m ) becomes for suf- 
ficiently large in. By the definition of S(T±,. . . ,T m ) in (8), for sufficiently 
large m, d(Ti, . . . ,T m ) = d . Consequently, P(liminf m _ +00 {d(Ti,. . . ,T m ) = 
do}) = 1. By Fatou's lemma, 

liminf P(d(Ti, . . . ,T m ) = do) > pfliminf{d(Ti, . . . ,T m ) = d }) = 1. 

Thus linim^oo P{d{T\ , . . . , T m ) = do) = 1. Since d(Ti , . . . , T m ) does not de- 
pend on n, (21) holds. 

Since T\, . . . , T m are independent of (Xx,Y\), . . . , (X n , Y n ), Lemma 6.1 im- 
plies that lim n ^oo P(d(Ti, ..., T m ) = d(T x , ■■■, T m )\T\ = t 1 ,...,T m = t m ) = 1. 
By the dominated convergence theorem, limn^oo P(d(Ti, . . . , T m ) = d(Ti, . . . , 
Tm)) = 1) which implies 

(22) lim lim P(d(T u . . . ,T m ) = d(7i, ...,T m )) = 1. 

m->oo n-+oo 

The desired assertion follows from (21) and (22). □ 

Theorem 6.3 confirms that the proposed CV criterion is indeed consistent 
in selecting the dimension of the central subspace. 

7. Simulation studies. In this section we compare the ensemble estima- 
tors, RMAVE-^c and RMAVE-^s, with existing methods such as SIR, 
SAVE, DMAVE, RMAVE and SR. For an estimate B of B Q , both assumed 
to be semiorthogonal without loss of generality, the estimation error is mea- 
sured by A(B,Bq) = ||-B-B T — BqBq ||, where || • || is the operator norm [Li, 
Zha and Chiaromonte (2005)]. For each setting, 100 replicates of the data 
are generated, unless stated otherwise. 

Example 7.1. The purpose of this example to demonstrate that the 
performance of RMAVE-^c is very stable as the ensemble size m varies. Let 

Yi = cos(2Aji) -cos(Aj 2 ) + 0.2ej, i = l,...,n, 
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Fig. 1. Averaged A(B, Bo) versus ensemble size m for Example 7.1. Left panel: n = 200. 
Right panel: n — 400 . 



where £j is a standard normal random variable, and X{ is a random vector 
in R 10 . The random vector Xi is generated by N(0,T,x), where the (i,j)th 
entry of T, x is 0.5l i-J 'l. For this model d = 2 and B = (ei,e 2 ) G R 10x2 , 
where ej is a vector whose ith entry is 1 and other entries are 0. The model 
was used in Li (1992) and Wang and Xia (2008). 

In Figure 1 we plot the averages of A(B,Bq) over the 100 simulated 
samples versus different ensemble sizes m, ranging from 5 to 50. The left 
panel corresponds to n = 200, and the right panel corresponds to n = 400. 
We see that the average error is quite stable as m varies: for n = 200 it is 
between 0.1806 and 0.1922, and for n = 400 it is between 0.1066 and 0.1086. 

Example 7.2. The following regression model is a modification of Ex- 
ample 3 of Wang and Xia (2008): 

Xi\ 



Y; 



+ XizEi, 



0.5 + (X i2 + 1.5) 2 

where e, and X, are generated as in Example 7.1 with n = 400. In this case 
4 = 3 and B = (e x , e 2 , e 3 ) G M px3 . We use m = 15 for RMAVE-^c and the 
number of slices H = 5 for SR, SIR and SAVE. Table 1 below indicates that 
RMAVE-£c is the best performer, followed by DMAVE and SR. 



Table 1 

Comparisons for (Example 7.2). Entries are mean ± standard error of A(Bq,Bq) 

calculated from 100 simulation samples 



p RMAVE-£c 



SR 



RMAVE-£b RMAVE DMAVE 



SIR 



SAVE 



10 


0.388 


0.513 


0.878 


0.750 


0.409 


0.851 


0.903 




±0.134 


±0.192 


±0.136 


±0.170 


±0.153 


±0.115 


±0.117 


20 


0.638 


0.844 


0.954 


0.880 


0.719 


0.947 


0.977 




±0.150 


±0.146 


±0.060 


±0.111 


±0.159 


±0.055 


±0.034 
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Table 2 

Comparisons for (Example 7.3). Entries are mean± standard error of A(Bq,Bo) 

calculated from 100 simulation samples 



p 


RMAVE-Sc 


SR 


RMAVE-S B 


RMAVE 


DMAVE 


SIR 


SAVE 


10 


0.101 


0.149 


0.156 


0.149 


0.100 


0.256 


0.294 




±0.023 


±0.044 


±0.047 


±0.041 


±0.019 


±0.048 


±0.063 


20 


0.155 


0.246 


0.243 


0.243 


0.148 


0.339 


0.510 




±0.028 


±0.059 


±0.055 


±0.098 


±0.020 


±0.046 


±0.096 



Example 7.3. The following model is taken from Zhu and Zeng (2006), 
Example 3: 

where e, is a standard normal random variable, a = 0.2 and Xi ~ N(0,I p ). 
The regression coefficients are fa = e\ + • • • + e^ and fa = e P -3 + ■ ■■ + e p . 
Thus we have do = 2 and Bq = (fa, fa). The specifications for n,m,H are 
the same as Example 7.2. Table 2 below reports the results. In this case 
DMAVE is the top performer, with RMAVE-^c as a close second. 

Example 7.4. This example is to investigate the effectiveness of the 
CV criterion for order determination introduced in Section 5, as used in 
conjunction with RMAVE-^cs in the spirit similar to Example 4 of Wang 
and Xia (2008). We consider the following three models: 

Model A: Y t = (Xj fa~ l + 0.2^, fa = e x + • • • + e 4 , 

Model B: Y { = cos(2Xii) - cos(X i2 ) + 0.2^, 

Model C: Yi = X a /[0.5 + (X i2 + 1.5) 2 ] + Xf 3 £i, 

where X is generated as in Example 7.1. We take p = 10, m = 15. Table 3 
shows that, as the sample size n increases, the percentages of correctly iden- 
tified dimensions quickly approach to 100% for all three models, which is 
comparable with the results in Wang and Xia (2008). Our results for the first 
two models show substantial improvement over the corresponding results in 

Table 3 

Percentage of correctly estimated do using the CV 

criterion combined with RMAVE-^c (Example 7.4) 

Model n = 100 n = 200 n = 400 

A 0.95 1.00 1.00 

B 0.83 1.00 1.00 

C 0.39 0.60 0.79 



20 X. YIN AND B. LI 

Table 4 

Comparisons for (Example 7.5). Entries are mean± standard error of A(Bq,Bo) 

calculated from 100 simulation samples 

p Model RMAVE-^c SR RMAVE-Sb RMAVE DMAVE SIR SAVE 



10 D 


0.052 


0.082 


0.994 


0.983 


0.313 


0.306 


0.243 




±0.014 


±0.024 


±0.008 


±0.067 


±0.412 


±0.072 


±0.075 


E 


0.053 


0.059 


0.609 


0.058 


0.054 


0.141 


0.154 




±0.016 


±0.015 


±0.173 


±0.015 


±0.014 


±0.037 


±0.040 


F 


0.163 


0.178 


0.869 


0.798 


0.208 


0.225 


0.230 




±0.043 


±0.055 


±0.979 


±0.141 


±0.073 


±0.056 


±0.067 


G 


0.198 


0.217 


0.387 


0.349 


0.931 


0.242 


0.601 




±0.053 


±0.059 


±0.178 


±0.109 


±0.084 


±0.058 


±0.230 


20 D 


0.080 


0.122 


0.955 


0.996 


0.501 


0.399 


0.363 




±0.016 


±0.023 


±0.006 


±0.006 


±0.459 


±0.070 


±0.068 


E 


0.068 


0.092 


0.694 


0.085 


0.075 


0.179 


0.236 




±0.015 


±0.018 


±0.200 


±0.017 


±0.014 


±0.030 


±0.061 


F 


0.250 


0.281 


0.927 


0.879 


0.323 


0.303 


0.414 




±0.048 


±0.052 


±0.058 


±0.103 


±0.080 


±0.053 


±0.083 


G 


0.299 


0.321 


0.507 


0.535 


0.973 


0.312 


0.957 




±0.048 


±0.054 


±0.146 


±0.166 


±0.034 


±0.050 


±0.060 



Wang and Xia (2008) for n = 100 and n = 200. A possible explanation of 
this improvement is that RMAVE- 3c allows us to make repeated use of the 
sample of responses, with each repetition exploring a different aspect of the 
central subspace. In other words the ensemble approach makes fuller use of 
the data than dividing them into slices. 

Example 7.5. The four models in this example are the same as those 
used in Example 5 of Wang and Xia (2008): 

Model D: Y t = {Xj ft)" 1 + 0.2e i? = e x + • • • + e 4 ; 

Model E: Yi = 0.1(Xj (3 + Eif, /3 = ei + ■ • ■ + e 4 ; 

Model F: Yi = exp(A 4 T /3) x £;, ft = e\ + 0.5e 2 + e 3 ; 

Model G: Y { = sign(2X a +e a )x log|2A i2 +4 + e i2 \. 

Here, X,n,m,H are the same as specified in Example 7.1. Table 4 below 
reports the result for n = 400. We see that RMAVE-^c again consistently 
outperforms other estimators in all four models, though SR is quite close to 
it in some cases. 

Example 7.6. As we noted before, the family $c is particularly use- 
ful for recovering directions in the Sy\x that do not belong to Se(y\x)> 
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Table 5 

Comparisons for (Example 7.6). Entries are mean± standard error of A(Bq,Bo) 

calculated from 100 simulation samples 



p 


RMAVE-Sc 


SR 


RMAVE-Sb 


RMAVE 


DMAVE 


SIR 


SAVE 


10 


0.098 


0.114 


0.128 


0.103 


0.132 


0.434 


0.310 




±0.023 


±0.031 


±0.035 


±0.026 


±0.031 


±0.134 


±0.086 


20 


0.131 


0.155 


0.180 


0.143 


0.207 


0.590 


0.547 




±0.026 


±0.033 


±0.038 


±0.028 


±0.062 


±0.106 


±0.132 



and when Y contains outliers. This example indicates that in the case 
where S Y \x = Se(y\x) an d Y contains no outliers — conditions favorable to 
RMAVE. Consider the model 

Y t = arcsin(l/(l + |0.5 + X a \)) + 0.2e 4 , 

where £j and X% E R 10 are generated as in Example 7.1. Note that in this 
case both Sy\x an d $e(y\x) are spanned by e±. We take m = 15, n = 400 
and the number of slices for SR, SIR and SAVE equal to 5. However, Table 5 
indicates that RMAVE-fc is slightly better than RMAVE. 

Example 7.7. The main point of this example is to demonstrate numer- 
ically the -^/n-consistency of RMAVE-5c ; which we have shown analytically 
in Section 6 for cIq < 3. A secondary point is to reconfirm the stability of 
this estimator against the change of ensemble size m, for a wide range of 
sample sizes n, which we have demonstrated in Example 7.1 for two sample 
sizes (n = 200,400). This second point also provides us intuition about the 
double limits, lim m _ i . 00 lim n _5. 00 , we took in Section 6. 

Here we adopt the approach of Wang and Xia (2008), Example 8. We use 
model D in Example 7.5. In Figure 2 we plot the averaged A(B, Bq) against 
l/y/n for m= 15,30,50,100. The value of l/\/n ranges from 0.045 to 0.1, 
corresponding to sample sizes n = 100, 200, 300, 400, 500 in reverse order. We 
can see that the curves are roughly straight lines passing through the origin, 
which confirms the -y/n-consistency. We also see that the performance of 
RMAVE-^c" is very stable as m changes, across different sample sizes. 

Example 7.8. Finally, we investigate the performance of $c f° r mul- 
tivariate responses. The model is taken from Li, Wen and Zhu (2008), 
Model 4.4. Here p = 6, s = 5. The predictor A, is generated from N(0,Iq). 
The error £j is generated from N(0, £), where S = diag(Si, £2), i n which 

1 -1/2 
-1/2 1/2 



1/2 











1/3 











1/4 
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0.04 0.05 0.06 0.07 0.08 0.09 



Fig. 2. Numerical demonstration of ^/n- consistency for RMAVE-^c as described in 
Example 7.7. 

The 5-dimensional response random vector Y is generated as: 
Y a = X l2 + 3X i2 /[0.5 + (X a + 1.5) 2 ] + e u 
Y i2 = X ll + e°- 5X ^+e i2 , 

Y{3 = Xn + Xi2 + £i3, 
Jj4 = £j4, 
Yib = £j5- 

For a fair comparison, we use the Frobenius norm instead of the operator 
norm for A(B, Bq), the former of which was used by Li, Wen and Zhu (2008). 
Table 6 shows the results for n = 100, averaged over 1,000 simulated samples. 
The columns PR-SIR and PR-RMAVE refer to projective resampling used 
in conjunction with the SIR and RMAVE, respectively. See Li, Wen and Zhu 
(2008). The numbers in the PR-SIR column is taken from that paper. For 
RMAVE-Jc we use m = 15,000 random directions; for PR-RMAVE, we use 
1,000 random directions. 

In this case PR-RMAVE performs the best among the three estimators. 
Note that in this example the central subspace and the central mean sub- 



Table 6 
Comparison of different estimators for 
multivariate Y (Example 7.8, n= 100,) 



PR-SIR 



RMAVE-3c 



PR-RMAVE 



0.276 ±0.1 



0.248 ±0.088 



0.206 ±0.071 
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space coincide, which is the most favorable scenario for methods derived 
from RMAVE. 

8. Discussion. In this paper we introduce a general method for combin- 
ing estimators of a family of central mean subspaces into a single estimator 
of the central subspace using the MAVE-type procedures as basic estimators 
for the central mean subspaces. Different combinations of the characterizing 
families and MAVE-type procedures result in a class of new estimators of 
the central subspace, which we call the ensemble estimators. Ensemble esti- 
mators exhaustively estimate the central subspace and are relatively easy to 
compute. The algorithm for estimation can be broken down into iterations 
of quadratic optimization steps, whose solutions have the least-square form. 
The ensemble estimators do not require special treatment for multivariate 
responses, because the characterizing nature of 5 automatically takes into 
account the multivariate information in the response. Ensemble estimators 
allow repeated use of the available sample of responses, and by doing so en- 
hance the estimation accuracy. They do not require dividing the sample into 
slices, which not only simplifies the operation but also avoid sensitivity to 
the number of slices. Ensemble estimators have the same convergence rates 
as their corresponding MAVE-type estimators. In particular, the RMAVE 
ensemble has the \/n-rate when the structural dimension do is no more 
than 4. 

An important problem is the choice of 5- At this stage we do not yet have 
a good theory to generate a universal criterion that can work across families. 
One theoretical difficulty in devising a general criterion to choose among dif- 
ferent families 5 is that different transformations of the response result in 
different scales that cannot be meaningfully compared. For example, if we 
use cross validation of prediction errors to choose among families, then we 
face the problem that the prediction errors in different families have differ- 
ent meanings. At this stage, we suspect that any general criterion capable 
of choosing among different families must be intrinsic to the probabilistic 
relation between X and Y, as reflected in the conditional distribution of Y 
given X , rather than specific to any form of transformation of the response. 

Our empirical knowledge seems to indicate that bounded transformations, 
such as 5c an d SR, are preferable to unbounded transformations, such as 
power transformation (Example 2.1) and Box-Cox transformation (Exam- 
ple 2.4), especially when the model permits extreme values in the response. 
A bounded characterizing family of transformations serve the dual purposes 
of comprehensively describing the central subspace and decreasing the lever- 
age of the extreme response values. In addition, the transformations in 5c 
make full reuse of the data at each resampling. In this respect it is rather 
similar to the bootstrap, except that the resampling is done by random pro- 
jection. Indeed, this is the very spirit of ensemble estimator we would like to 
advocate in this paper, and it is this aspect that distinguishes the ensemble 
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estimators from other sufficient dimension reduction estimators. Finally, in 
the majority of examples we considered in Section 7, the 5c- ensem ble es- 
timator consistently outperforms other methods. In light of these empirical 
evidences, we regard the 5c-f amu y as the overall best performer among the 
families we considered. 

The choice of number m is also important. Theorem 2.2 indicates that 
a large enough m will guarantee the exhaustive recovering of the central 
subspace, regardless of the characterizing family used. In practice, however, 
different families require different choices of m. For the family 5c> we rec- 
ommend to choose m as large as computationally feasible, because adding 
a new function in $c amounts to reusing the data one more time. Since the 
functions in $c are bounded and smooth, the ensemble estimator is stable 
as more functions are included. 

The general formulation of the ensemble estimators also provides a syn- 
thesis and fresh insights for many recently developed methods. In particular, 
it unifies the central mean subspace [Cook and Li (2002)], the central mo- 
ment subspace [Yin and Cook (2002)], Fourier transform estimators [Zhu 
and Zeng (2006)], dMAVE [Xia (2007)] and sliced regression [Wang and Xia 
(2008)] in a coherent system. Although in this paper we have focussed on 
MAVE ensemble and its variations, the ensemble approach can potentially 
be combined with any estimator of the central mean subspaces to recover 
the central subspace, such as the OLS [Li and Duan (1989), Duan and Li 
(1991)], pHd [Li (1992), Cook (1998)] and Iterative Hessian Transformations 
[Cook and Li (2002, 2004)]. 

Finally, the ensemble approach can also be used with other character- 
izing families which we cannot fully explore within this paper, but which 
may be especially useful for some applications. One example is the wavelet 
basis, such as the Haar basis briefly described in Example 2.6. Such families 
are highly effective for handling response variables that have sharp discon- 
tinuities, which frequently arise in image analysis and pattern recognition 
[Donoho and Johnstone (1994)]. We leave further exploration of these pos- 
sibilities to future research. 
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